In high-performance computing environments, input/output (I/O) from varioussources often contend for scare available bandwidth. Adding to the I/O operations inherent tothe failure-free execution of an application, I/O from checkpoint/restart (CR) operations (usedto ensure progress in the presence of failures) places an additional burden as it increases I/Ocontention, leading to degraded performance. In this work, we consider a cooperative schedulingpolicy that optimizes the overall performance of concurrently executing CR-based applicationswhich share valuable I/O resources. First, we provide a theoretical model and then derive aset of necessary constraints needed to minimize the global waste on the platform. Our resultsdemonstrate that the optimal checkpoint interval as defined by Young/Daly, while providing asensible metric for a single application, is not sufficient to optimally address resource contentionat the platform scale. We therefore show that combining optimal checkpointing periods with I/Oscheduling strategies can provide a significant improvement on the overall application performance,thereby maximizing platform throughput. Overall, these results provide critical analysis and directguidance on checkpointing large-scale workloads in the presence of competing I/O while minimizingthe impact on application performance.
展开▼